More advanced Multivariate

Susan Holmes and Wolfgang Huber

July 10th 2017

Advanced multivariate

Heterogeneity

Heterogeneity

Lecture on more advanced multivariate approaches

Multidimensional Scaling- Principal Coordinate Analyses

Suppose we are given a matrix of dissimilarities or distances and we want to build a useful map of the observations.

require(graphics)
data(eurodist)
# look at raw distances
as.matrix(eurodist)[1:7,1:7]
##            Athens Barcelona Brussels Calais Cherbourg Cologne Copenhagen
## Athens          0      3313     2963   3175      3339    2762       3276
## Barcelona    3313         0     1318   1326      1294    1498       2218
## Brussels     2963      1318        0    204       583     206        966
## Calais       3175      1326      204      0       460     409       1136
## Cherbourg    3339      1294      583    460         0     785       1545
## Cologne      2762      1498      206    409       785       0        760
## Copenhagen   3276      2218      966   1136      1545     760          0
# graphical representation
heatmap(as.matrix(eurodist))

These were computed as actualy distances across land so we actually know that we could find a 2-3 dimensional map that would represent the data well.

eck=read.table("http://bios221.stanford.edu/data/eckman.txt",header=TRUE)
nc=nrow(eck)
   eck[1:9,1:9]
##   w434 w445 w465 w472 w490 w504 w537 w555 w584
## 1 0.00 0.86 0.42 0.42 0.18 0.06 0.07 0.04 0.02
## 2 0.86 0.00 0.50 0.44 0.22 0.09 0.07 0.07 0.02
## 3 0.42 0.50 0.00 0.81 0.47 0.17 0.10 0.08 0.02
## 4 0.42 0.44 0.81 0.00 0.54 0.25 0.10 0.09 0.02
## 5 0.18 0.22 0.47 0.54 0.00 0.61 0.31 0.26 0.07
## 6 0.06 0.09 0.17 0.25 0.61 0.00 0.62 0.45 0.14
## 7 0.07 0.07 0.10 0.10 0.31 0.62 0.00 0.73 0.22
## 8 0.04 0.07 0.08 0.09 0.26 0.45 0.73 0.00 0.33
## 9 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33 0.00
   require(ade4)
   queck=quasieuclid(as.dist(1-eck))
   eck.pco=dudi.pco(queck,scannf=F,nf=2)
   names(eck.pco)
##  [1] "eig"  "rank" "nf"   "cw"   "tab"  "li"   "l1"   "c1"   "co"   "lw"  
## [11] "call"
   scatter(eck.pco,posi="bottomright")

Other available analysis and plotting options

require(vegan)
require(ggplot2)
ggscree=function(out){
  n=length(out)
  xs=1:n
  df=data.frame(eig=out,xs)
    ggplot(data=df, aes(x=xs, y=eig)) + 
    geom_bar(stat="identity",width=0.3,color="red",fill="orange")
 }
res=cmdscale(as.dist(1-eck))
res
##            [,1]        [,2]
## w434 -0.2137161 -0.41852576
## w445 -0.2562012 -0.41065436
## w465 -0.4119890 -0.30925977
## w472 -0.4369586 -0.27266935
## w490 -0.4388604  0.07518594
## w504 -0.3364868  0.37262279
## w537 -0.2429950  0.47735969
## w555 -0.1893125  0.48826989
## w584  0.2418131  0.29739050
## w600  0.4016882  0.15279550
## w610  0.4986351 -0.02861306
## w628  0.4956801 -0.10485148
## w651  0.4582372 -0.14801944
## w674  0.4304660 -0.17103108
res=cmdscale(as.dist(1-eck),eig=TRUE)
names(res)
## [1] "points" "eig"    "x"      "ac"     "GOF"
res$eig
##  [1]  1.982134e+00  1.299333e+00  4.409240e-01  3.739315e-01  1.584615e-01
##  [6]  1.030075e-01  4.151723e-02  3.166641e-02  1.846570e-02  4.165058e-03
## [11]  1.345088e-03  2.151057e-16 -2.673286e-02 -4.743236e-02
out=res$eig
col14=rainbow(14)
ggscree(out)

MDS = data.frame(PCo1 = res$points[,1], PCo2 = res$points[,2])
ggplot(data = MDS, aes(PCo1, PCo2)) +
  geom_point(size=4,color = col14)

How does the method work?

Let’s do our forwards- backwards trick, start with forwards: if we knew the original coordinates what would D look like?

From X to Distances D

Given a Euclidean distance \(D\) between the observation-rows.\ Call \(B=XX'\), if \(D^{(2)}\) is the matrix of squared distances between rows of \(X\) in the euclidean coordinates, we can show that \[-\frac{1}{2} HD^{(2)} H=B\]

\[H=I-\frac{1}{n}11'\] is the centering matrix.

\[d_{i,j} = \sqrt{(x_i^1 - x_j^1)^2 + \dots + (x_i^p - x_j^p)^2}. \mbox{ and } -\frac{1}{2} HD^{(2)} H=B\]

Backward from D to X

We can go backwards from a matrix \(D\) to \(X\) by taking the eigendecomposition of \(B\) in much the same way that PCA provides the best rank \(r\) approximation for data by taking the singular value decomposition of \(X\), or the eigendecomposition of \(XX'\).

\[X^{(r)}=US^{(r)}V'\mbox{ with } S^{(r)}= \begin{pmatrix} s_1 &0 & 0 &0 &...\\ 0&s_2&0 & 0 &...\\ 0& 0& ...& ... & ...\\ 0 & 0 & ... & s_r &...\\ ...& ...& ...& 0 & 0 \\ \end{pmatrix} \] This provides the best approximate representation in an Euclidean space of dimension r. The algorithm provides points in a Euclidean space that have approximately the same distances as those provided by \(D^2\).

Classical MDS Algorithm

In summary, given an \(n \times n\) matrix of interpoint distances \(D\), one can solve for points achieving these distances by: 1. Double centering the interpoint distance squared matrix: \(B = -\frac{1}{2}H D^2 H\).

  1. Diagonalizing \(B\): \(B = U \Lambda U^T\).

  2. Extracting \(\tilde{X}\): \(\tilde{X} = U \Lambda^{1/2}\).

Examples of output

Screeplot for the Europe Data

library(pheatmap)
load("../../Books/CUBook/data/distEuroN.RData")
seteuro=as.matrix(distEuroN)[1:13,1:13]
pheatmap(seteuro,cluster_rows = TRUE,treeheight_row=0.01,
treeheight_col=0.03,fontsize_col = 8, cellwidth=13,cellheight=13)

Given the these distances between cities, multidimensional scaling (MDS) provides a `map’ of their relative locations. Of course, in this case the distances were originally measured across land so we actually expect to find a 2 dimensional map that would represent the data well, in general our maps will not be as accurate.

res=cmdscale(distEuroN,eig=TRUE)
n=8
out=data.frame(k=1:n,eig=res$eig[1:n])
ggplot(data=out, aes(x=k, y=eig)) +
geom_bar(stat="identity",width=0.5,color="orange",fill="pink")

Position of Cities

library("ggrepel")
MDS = data.frame(PCo1 = res$points[,1], PCo2 = res$points[,2],
        labs=rownames(res$points))
ggplot(data = MDS, aes(x=PCo1, y=PCo2,label=labs) ) +
geom_text_repel(size=3)+ xlim(-2500, 2500)+ylim(-1500,1500)

Notice the orientation. We also do not have `variables’ to interpret.

load("../../Books/CUBook/data/citiesX.RData")
X=Xcoord
DdotD=as.matrix(dist(X)^2)

The relative distances do not depend on the point of origin of the data and we will center the data by using a matrix \(H\): the centering matrix defined as \(H=I-\frac{1}{n}{\mathbf{11}}^t\).

Let’s check the centering property of \(H\) using R:

X=Xcoord
H=diag(rep(1,20))-(1/20)*matrix(1,20,20)
Xc=sweep(X,2,apply(X,2,mean))
Xc[1:2,]
##           [,1]      [,2]
## [1,] -8.701065 17.305235
## [2,] -5.284465 -4.228165
HX=H%*%X
HX[1:2,]
##           [,1]      [,2]
## [1,] -8.701065 17.305235
## [2,] -5.284465 -4.228165
apply(HX,2,mean)
## [1] -4.240895e-15 -7.158336e-16

Consider the points centered at the origin given by the \(HX\) matrix and compute its cross product , check that and are equal.

B1=-0.5*H%*%DdotD%*%H
B2=HX%*%t(HX)
max(abs(B1-B2))
## [1] 2.273737e-13

From \(D\bullet D\) to \(X\) using singular vectors

We can go backwards from a matrix \(D\bullet D\) to \(X\) by taking the eigendecomposition of \(B\) as defined in (). This also enables us to choose how many coordinates, or columns we want for the \(X\) matrix, in much the same way that PCA provides the best rank \(r\) approximation for data.

Note: We actually have the choice between using the singular value decomposition of \(HX\) or the eigendecomposition of \(HX(HX)^t\).

\[HX^{(r)}=US^{(r)}V'\mbox{ with } S^{(r)}= \begin{pmatrix} s_1 &0 & 0 &0 &...\\ 0&s_2&0 & 0 &...\\ 0& 0& ...& ... & ...\\ 0 & 0 & ... & s_r &...\\ ...& ...& ...& 0 & 0 \\ \end{pmatrix}\] This provides the best approximate representation in an Euclidean space of dimension r. The algorithm gives us the coordinates of points that have approximately the same distances as those provided by \(D\) matrix.

Robust MDS: Non metric Multidimensional Scaling

require(vegan)
require(ggplot2)
res=metaMDS(as.dist(1-eck))
## Run 0 stress 0.02310251 
## Run 1 stress 0.02310251 
## ... New best solution
## ... Procrustes: rmse 4.691566e-06  max resid 8.274599e-06 
## ... Similar to previous best
## Run 2 stress 0.02310251 
## ... Procrustes: rmse 1.897753e-06  max resid 2.88282e-06 
## ... Similar to previous best
## Run 3 stress 0.02310251 
## ... New best solution
## ... Procrustes: rmse 2.443125e-06  max resid 4.132161e-06 
## ... Similar to previous best
## Run 4 stress 0.02310251 
## ... Procrustes: rmse 2.032474e-06  max resid 3.661681e-06 
## ... Similar to previous best
## Run 5 stress 0.02310251 
## ... Procrustes: rmse 2.485257e-06  max resid 4.12147e-06 
## ... Similar to previous best
## Run 6 stress 0.02310251 
## ... New best solution
## ... Procrustes: rmse 1.451523e-06  max resid 2.547976e-06 
## ... Similar to previous best
## Run 7 stress 0.02310251 
## ... Procrustes: rmse 3.808962e-06  max resid 5.919006e-06 
## ... Similar to previous best
## Run 8 stress 0.02310251 
## ... Procrustes: rmse 2.586776e-06  max resid 4.27139e-06 
## ... Similar to previous best
## Run 9 stress 0.02310251 
## ... Procrustes: rmse 7.788415e-07  max resid 1.105233e-06 
## ... Similar to previous best
## Run 10 stress 0.02310251 
## ... Procrustes: rmse 9.199237e-07  max resid 1.374647e-06 
## ... Similar to previous best
## Run 11 stress 0.02310251 
## ... New best solution
## ... Procrustes: rmse 1.143034e-06  max resid 1.944126e-06 
## ... Similar to previous best
## Run 12 stress 0.02310251 
## ... Procrustes: rmse 1.782285e-06  max resid 3.575472e-06 
## ... Similar to previous best
## Run 13 stress 0.02310251 
## ... Procrustes: rmse 4.776311e-06  max resid 7.83416e-06 
## ... Similar to previous best
## Run 14 stress 0.02310251 
## ... Procrustes: rmse 2.383557e-06  max resid 4.865684e-06 
## ... Similar to previous best
## Run 15 stress 0.02310251 
## ... Procrustes: rmse 2.484712e-06  max resid 4.407327e-06 
## ... Similar to previous best
## Run 16 stress 0.02310251 
## ... Procrustes: rmse 2.912699e-06  max resid 4.759494e-06 
## ... Similar to previous best
## Run 17 stress 0.02310251 
## ... Procrustes: rmse 1.49705e-06  max resid 2.695565e-06 
## ... Similar to previous best
## Run 18 stress 0.02310251 
## ... New best solution
## ... Procrustes: rmse 2.621304e-06  max resid 4.535362e-06 
## ... Similar to previous best
## Run 19 stress 0.02310251 
## ... Procrustes: rmse 1.054845e-06  max resid 1.981816e-06 
## ... Similar to previous best
## Run 20 stress 0.02310251 
## ... New best solution
## ... Procrustes: rmse 2.107474e-06  max resid 3.970855e-06 
## ... Similar to previous best
## *** Solution reached
col14=rainbow(14)
NMDS = data.frame(PCo1 = res$points[,1], PCo2 = res$points[,2])
ggplot(data = NMDS, aes(PCo1, PCo2)) +
  geom_point(size=4,color = col14)

Robust versions of MDS: NMDS

Multidimensional scaling aims to minimize the difference between the squared distances as given by \(D\bullet D\) and the squared distances between the points with their new coordinates.

Unfortunately, squared distances tend to be sensitive to outliers and one single large value can skew the whole analysis.

We need procedures that are less dependent on the actual values of the distances but still take into account the distances’ relative rankings.

Methods based on ranks are much more robust than those based on the actual values in general and many nonparametric tests are based on this reduction of data to their ranks.

Robust ordination, called non metric multidimensional scaling (NMDS for short) only attempts to embed the points in a new space such that the order of the reconstructed distances in the new map is the same as the ordering of the original distance matrix.

Non metric MDS looks for a transformation \(f\) of the given dissimilarities in the matrix \(d\) and a set of coordinates in a low dimensional space ( the map) such that the distance in this new map is \(\tilde{d}\) and \(f(d)\thickapprox \tilde{d}\). The quality of the approximation can be measured by \[STRESS^2=\frac{\sum(f(d)-\tilde{d})^2}{\sum d^2}.\]

NMDS is not sequential in the sense that we have to specify the underlying dimensionality at the outset and the optimization is run to maximize the reconstruction of the distances according to that number. There is no notion of percentage of variation explained by individual axes as provided in PCA. However, we can make a simili-screeplot by running the program for all the successive values of k (k=1,k=2,k=3, k=4,…) and looking at how well the STRESS drops. Here is an example of looking at these successive approximations and their goodness of fit.

Several replicates at each dimension are run to evaluate the stability of the STRESS. We see that the STRESS drops dramatically after the first axis, thus indicating that a one dimensional solution may be appropriate here, it easier to sue two dimensions because it allows us to keep the labels from overlapping.

library(vegan)
nmds.stress<-function(x,sim=20,kmax=4) {
  stressp=matrix(0,nrow=sim,ncol=kmax)
  for  (i in 1:kmax)
  stressp[,i]=replicate(sim,metaMDS(x, autotransform=F, k=i)$stress)
  return(stressp)
}
bootscree=nmds.stress(distEuroN,sim=100)
library(reshape)
colnames(bootscree)=paste("Dim",1:4,sep=" ")
dfm=melt(data.frame(bootscree),variable_name="stress")
ggplot(dfm,aes(x = stress,y = value))  + geom_boxplot()+
   xlab("")+ylab("Stress")

###Compare the distances from the approximations.
out=monoMDS(distEuroN,k=2,autotransform=FALSE)
stressplot(out,distEuroN,pch="")

set.seed(499245)
out2=monoMDS(distEuroN,k=2,autotransform=FALSE)
###We flip the  coordinate signs.
out2$points[,1]=-(out2$points[,1])
out2$points[,2]=-(out2$points[,2])
NMDSout = data.frame(NMDS1 = out2$points[,1], NMDS2 = out2$points[,2],
        labs=rownames(out2$points))
ggplot(data = NMDSout, aes(NMDS1, NMDS2,label=labs) ) +
               geom_text_repel(aes(NMDS1, NMDS2, label = labs)) +
               xlim(-2, 1.8)+ylim(-1.5,1.5)+theme_bw()

Let’s compare the output of the two different MDS programs, the simple least squares approximation and the nonparametric rank approximation.

MDS = data.frame(PCo1 = -res$points[,1], PCo2 = -res$points[,2],
        labs=rownames(res$points))
ggplot(data = MDS, aes(PCo1,PCo2,label=labs) ) +
         geom_text_repel(aes(PCo1, PCo2, label = labs)) +
         xlim(-2200, 2300)+ylim(-1500,1500)+theme_bw()

Many different distances give different multidimensional scaling projections

Example of all the choices

Correspondence Analysis:A weighted PCA for Contingency Tables

What is a contingency table?

\[{\small \begin{array}{rrrrrrrr} \hline & black & blue & green & grey & orange & purple & white \\ \hline quiet & 27700 & 21500 & 21400 & 8750 & 12200 & 8210 & 25100 \\ angry & 29700 & 15300 & 17400 & 7520 & 10400 & 7100 & 17300 \\ clever & 16500 & 12700 & 13200 & 4950 & 6930 & 4160 & 14200 \\ depressed &\hskip-0.3cm 14800 & 9570 & 9830 & 1470 & 3300 & 1020 & 12700 \\ happy & 193000 & 83100 & 87300 & 19200 & 42200 & 26100 & 91500 \\ lively & 18400 & 12500 & 13500 & 6590 & 6210 & 4880 & 14800 \\ perplexed &\hskip-0.3cm 1100 & 713 & 801 & 189 & 233 & 152 & 1090 \\ virtuous & 1790 & 802 & 1020 & 200 & 247 & 173 & 1650 \\ \hline \end{array} }\]

Categorical Data Representations

(the long version representation)

Indicator variables: \[\begin{array}{lrrrrrrr} Patient & Mutation 1 & Mutation 2 & Mutation 3$\ldots & \\ AHY789 & 0 & 0 & 0 \\ AHX717 & 1 & 0 & 1 \\ AHX543 & 1 & 0 & 0 \\ \end{array} \] are transformed into a Mutation \(\times\) Mutation matrix.

\[\begin{array}{lrrrrrrr} Patient & AHY789 & AHX717 & AHX543 \ldots & \\ AHY789 & 853 & 29 & 10 \\ AHX717 & 29 & 853 & 52 \\ AHX543 & 10 & 52 & 853 \\ \end{array} \]

Hair Color, Eye Color

require(ade4)
HairColor=HairEyeColor[,,2]
HairColor
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8
chisq.test(HairColor)
## 
##  Pearson's Chi-squared test
## 
## data:  HairColor
## X-squared = 106.66, df = 9, p-value < 2.2e-16

Conclusion The data are not independent, the categories show a pattern of dependency, what can we say about them?

Independence: computationally

rowsums=as.matrix(apply(HairColor,1,sum))
rowsums
##       [,1]
## Black   52
## Brown  143
## Red     37
## Blond   81
colsums=as.matrix(apply(HairColor,2,sum))
colsums
##       [,1]
## Brown  122
## Blue   114
## Hazel   46
## Green   31
HCexp=round(rowsums%*%t(colsums)/sum(colsums))
Exp = outer(apply(HairColor, 1, sum), apply(HairColor, 2, sum))
#Here is actually how the chisquare is computed
sum((HairColor - Exp)^2/Exp)
## [1] 97344.34
HairColor-HCexp
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    16  -10    -3    -3
##   Brown    10  -18     8     0
##   Red       2   -6     2     3
##   Blond   -28   34    -7     0
require(vcd)
mosaic(HairColor,shade=TRUE)

What special property does the HCexp (Exp) matrix have?

mosaic(HCexp,shade=TRUE)

Independence: mathematically

If we are comparing two categorical variables, (hair color, eye color), (color, emotion).

Counts in the table can be compared to the margin products.

For a \(I \times J\) contingency table with a total sample size of \(n=\sum_{i=1}^I \sum_{j=1}^J n_{ij}= n_{\cdot \cdot}\).

\[n_{ij} \doteq \frac{n_{i\cdot}}{n} \frac{n_{\cdot j}}{n} n \] can also be written: \(N \doteq c r' n\), where \[c=\frac{1}{n} {N} {\mathbf 1}_m \;\mbox{ and }\; r'=\frac{1}{n} N' {\mathbf 1}_p \]

The departure from independence is measured by the \(\chi^2\) statistic \[ {\cal X}^2=\sum_{i,j} [\frac{(n_{ij}-\frac{n_{i\cdot}}{n}\frac{n_{\cdot j}}{n}n)^2} {\frac{n_{i\cdot}n_{\cdot j}}{n^2}n}]\]

Correspondence Analysis is like PCA using \(\chi^2\) distances

Many implemetations: - dudi.coa in ade4 - CCA in vegan - ordinate in phyloseq

library("ade4")
HairColor = HairEyeColor[,,2]
chisq.test(HairColor)
## 
##  Pearson's Chi-squared test
## 
## data:  HairColor
## X-squared = 106.66, df = 9, p-value < 2.2e-16
res.coa=dudi.coa(HairColor,scannf=FALSE,nf=3)
names(res.coa)
scatter(res.coa)
s.label(res.coa$c1,boxes=FALSE)
s.label(res.coa$li,add.plot=TRUE)

CCA in vegan

library(vegan)
res.ca=vegan::cca(HairColor)
plot(res.ca,scaling=3)

Example: Plato’s Sentence Endings

Table from Brandwood and Cox

Table from Brandwood and Cox

The dates Plato wrote these various `books’ is not known. We take the sentence endings and use those pattern frequencies as the data.

The first (earliest) is known to be Republica. The last (latest) is known to be Laws.

platof=read.table("../../Books/CUBook/data/platof.txt",header=T)
platof[1:8,]
##       Rep Laws Crit Phil Pol Soph Tim
## uuuuu  42   91    5   24  13   26  18
## -uuuu  60  144    3   27  19   33  30
## u-uuu  64   72    3   20  24   31  46
## uu-uu  72   98    2   25  20   24  14
## uuu-u  79  113   10   38  25   22  26
## uuuu-  76  144    6   46  22   23  27
## --uuu  79  102    5   41  25   30  26
## -u-uu  83   68    3   14  18   37  26
res.plato=dudi.coa(platof,scannf=FALSE,nf=2)
scatter(res.plato)

s.label(res.plato$co)
s.label(res.plato$li,add.plot=T)

names(res.plato)
##  [1] "tab"  "cw"   "lw"   "eig"  "rank" "nf"   "c1"   "li"   "co"   "l1"  
## [11] "call" "N"
sum(res.plato$eig)
## [1] 0.132618
round(res.plato$eig,1)
## [1] 0.1 0.0 0.0 0.0 0.0 0.0

More About Gradients

The Boomer Lake example: Arch Effect for Ecologists

Two species count matrices

lakelike
##       plant1 plant2 plant3 plant4 plant5 plant6 plant7 plant8 plant9
## loc1       5      4      0      0      1      0      0      0      0
## loc2       5      6      5      4      2      2      0      1      2
## loc3       3      4      6      4      4      4      1      0      0
## loc4       2      4      4      7      4      3      2      2      2
## loc5       1      3      4      4      6      5      3      2      2
## loc6       0      2      2      3      4      5      5      3      2
## loc7       0      0      1      2      3      4      6      7      3
## loc8       1      0      0      2      2      4      5      6      4
## loc9       1      1      0      0      2      2      4      4      5
## loc10      1      0      0      0      1      2      2      3      5
##       plant10 plant11 plant12 plant13 plant14 plant15
## loc1        0       0       0       1       0       0
## loc2        0       0       2       0       0       1
## loc3        1       0       0       1       0       0
## loc4        3       0       1       1       0       1
## loc5        0       0       0       0       1       1
## loc6        1       0       0       0       0       1
## loc7        2       2       0       0       1       0
## loc8        4       4       2       0       1       0
## loc9        4       3       2       1       0       1
## loc10       5       4       3       2       1       1
lakelikeh
##       plant1 plant2 plant3 plant4 plant5 plant6 plant7 plant8 plant9
## loc1       5      4      0      0      1      0      0      0      0
## loc2       5      6      5      4      2      2      0      1      2
## loc3       3      4      6      4      4      4      1      0      0
## loc4      20     40     40     70     40     30     20     20     20
## loc5       1      3      4      4      6      5      3      2      2
## loc6       0      2      2      3      4      5      5      3      2
## loc7       0      0      1      2      3      4      6      7      3
## loc8       1      0      0      2      2      4      5      6      4
## loc9       1      1      0      0      2      2      4      4      5
## loc10      1      0      0      0      1      2      2      3      5
##       plant10 plant11 plant12 plant13 plant14 plant15
## loc1        0       0       0       1       0       0
## loc2        0       0       2       0       0       1
## loc3        1       0       0       1       0       0
## loc4       30       0      10      10       0      10
## loc5        0       0       0       0       1       1
## loc6        1       0       0       0       0       1
## loc7        2       2       0       0       1       0
## loc8        4       4       2       0       1       0
## loc9        4       3       2       1       0       1
## loc10       5       4       3       2       1       1
reslake=dudi.coa(lakelike,scannf=FALSE,nf=2)
reslake2=dudi.pca(lakelike,scannf=FALSE,nf=2)
reslakeh=dudi.coa(lakelikeh,scannf=FALSE,nf=2)
reslakeh2=dudi.pca(lakelikeh,scannf=FALSE,nf=2)

Compare output from PCA and Correspondence Analysis

First Data Set

Principal Components

scatter(reslake2)

s.label(reslake2$li)

Correspondence Analysis

scatter(reslake)

s.label(reslake$li)

Second Data set

Principal Components

s.label(reslakeh2$co)

s.label(reslakeh2$li)

Correspondence Analysis

s.label(reslakeh$li)

s.label(reslakeh$co)

s.label(reslakeh$li)

s.label(reslakeh2$li)

Other methods for gradients

We will follow an analysis of Moignard et al., 2015.

This paper describes the dynamics of blood cell development. The data are single cell gene expression measurements of 3,934 cells with blood and endothelial potential from five populations from between embryonic day (E)7.0 and E8.25.

Cell tree

Cell tree

The four cell populations studied here are representative of three sequential states (PS,NP,HF) and two possible final branches (4SG and 4SFG\(^{-}\)).

## [1] 3934   46
## typesort sortA sortB
##                     
##           3175   759

The classical multidimensional scaling on two distances matrices can be carried out using:

dist2n.euclid=dist(Norm)
dist1n.l1=dist(Norm,"manhattan")
cellsn.cmds=cmdscale(dist1n.l1,k=20,eig=TRUE)
cellsn2.cmds=cmdscale(dist2n.euclid,k=20,eig=TRUE)
perc1=round(100*sum(cellsn.cmds$eig[1:2])/sum(cellsn.cmds$eig))
perc2=round(100*sum(cellsn2.cmds$eig[1:2])/sum(cellsn2.cmds$eig))

We look at the underlying dimension and see below that two dimensions can provide a substantial percentage of the variance.

barplot(100*cellsn.cmds$eig[1:20]/sum(cellsn.cmds$eig))

barplot(100*cellsn2.cmds$eig[1:20]/sum(cellsn.cmds$eig))

cellsmds=data.frame(Axis1=cellsn.cmds$points[,1],Axis2=cellsn.cmds$points[,2])
cells2mds=data.frame(Axis1=cellsn2.cmds$points[,1],Axis2=cellsn2.cmds$points[,2])
ggplot(cellsmds,aes(x=Axis1,y=Axis2))+
  geom_point(aes(color=celltypes),color=cellcol) +xlab("Axis1-CMDS") +ylab("Axis2")

ggplot(cells2mds,aes(x=Axis1,y=Axis2))+
  geom_point(aes(color=celltypes),color=cellcol) +xlab("Axis1-CMDS2") +ylab("Axis2")

Nonlinear methods

The cells are not distributed uniformly in the lower dimensions we have been looking at, they form a horseshoe.

Finding Time

Horseshoes: A la recherche

Multidimensional scaling and non metric multidimensional scaling aims to represent all distances as precisely as possible and the large distances between far away points skew the representations.

It can be beneficial when looking for gradients or low dimensional manifolds to restrict ourselves to approximations of points that are close together.

These methods try to represent local (small) distances well and do not try to approximate distances between faraway points with too much accuracy.

There has been substantial progress in such methods in recent years. The use of Kernels computed using the calculated interpoint distances allows us to decrease the importance of points that are far apart. A radial basis kernel can be of the form

\[1-\exp\{-\frac{d(x,y)^2}{\sigma^2}\} ,\mbox{ where } \sigma^2 \mbox{ is fixed.}\] or the l1 version: \[1-\exp\{-\frac{d(x,y)}{\sigma}\} ,\mbox{ where } \sigma \mbox{ is fixed.}\]

t-SNE

If we add flexibility in the kernel defined above and allow the \(\sigma^2\) parameter to vary locally. Then we normalize these so they sum to one. Thus we obtain a probability distribution that serves as the probability that pairs of points in the high dimensional space are neighbors. The t-SNE method then constructs \(Y_i\) points in low dimensions so their distances are proportional to \((1+||y_i-y_j||^2)^{-1}\) minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution Q from P. This method is not robust and has the property of separating clusters of points artificially; this can clarify a complex situation. One can think it as a method akin to the network layout algorithms. They stretch the data to clarify relations, but the distances between point cannot be interpreted as on the same scales in different points in the plane.

Here is an example of the output of t-sne on the cell data:

library("Rtsne")
restsne = Rtsne(Norm, dims = 2, perplexity=30, verbose=TRUE, max_iter = 900)
## Read the 3934 x 46 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 3934
## Done in 1.23 seconds (sparsity = 0.032899)!
## Learning embedding...
## Iteration 50: error is 85.360702 (50 iterations in 2.13 seconds)
## Iteration 100: error is 73.795898 (50 iterations in 2.10 seconds)
## Iteration 150: error is 73.001769 (50 iterations in 2.03 seconds)
## Iteration 200: error is 72.780631 (50 iterations in 2.10 seconds)
## Iteration 250: error is 72.683101 (50 iterations in 2.18 seconds)
## Iteration 300: error is 2.392739 (50 iterations in 2.03 seconds)
## Iteration 350: error is 2.097175 (50 iterations in 2.12 seconds)
## Iteration 400: error is 1.951532 (50 iterations in 2.16 seconds)
## Iteration 450: error is 1.861258 (50 iterations in 2.22 seconds)
## Iteration 500: error is 1.803748 (50 iterations in 2.21 seconds)
## Iteration 550: error is 1.763741 (50 iterations in 2.24 seconds)
## Iteration 600: error is 1.738189 (50 iterations in 2.25 seconds)
## Iteration 650: error is 1.722425 (50 iterations in 2.26 seconds)
## Iteration 700: error is 1.711040 (50 iterations in 2.36 seconds)
## Iteration 750: error is 1.702465 (50 iterations in 2.29 seconds)
## Iteration 800: error is 1.694827 (50 iterations in 2.26 seconds)
## Iteration 850: error is 1.689483 (50 iterations in 2.30 seconds)
## Iteration 900: error is 1.684724 (50 iterations in 2.32 seconds)
## Fitting performed in 39.56 seconds.
dftsne=data.frame(taxis1=restsne$Y[,1],taxis2=restsne$Y[,2])
ggplot(dftsne,aes(x=taxis1,y=taxis2),group=celltypes)+
  geom_point(aes(color=celltypes)) +xlab("t-Axis1") +ylab("t-Axis2")

Experimenting with t-sne: tsne

Multi-table methods

Mantel coefficient between two distances

The Mantel coefficient, one of the earliest version of matrix association, is probably also the most popular now, especially in ecology . Given arbitrary dissimilarity matrices, it is defined as: \[ \mbox{r}_{\rm m}(\bfX,\bfY)= \frac{\sum_{i=1}^n \sum_{j=1,j \neq i}^n (d_{ij}^{\bfX}-\bar d^{\bfX})(d_{ij}^{\bfY}-\bar d^{\bfY})}{\sqrt{\sum_{i,j,j \neq i}(d_{ij}^{\bfX}-\bar d^{\bfX})^2 \sum_{i,j,j \neq i}(d_{ij}^{\bfY}-\bar d^{\bfY})^2}},\] with \(\bar d^{\bfX}\) (resp \(\bar d^{\bfY}\)) the mean of the upper diagonal terms of the dissimilarity matrix associated to \(\bfX\) (resp. to \(\bfY\)). This is the correlation coefficient between the vectors gathering the upper diagonal terms of the dissimilarity matrices. Its significance is assessed via a permutation test. The coefficient and its test are implemented in several R packages such as ade4, vegan and ecodist

Inertia, Co-Inertia and the RV coefficient

As in physics, we define inertia as a weighted sum of distances of weighted points.

This enables us to use abundance data in a contingency table and compute its inertia which in this case will be the weighted sum of the squares of distances between observed and expected frequencies, such as is used in computing the chisquare statistic.

Another generalization of variance-inertia is the useful Phylogenetic diversity index. (computing the sum of distances between a subset of taxa through the tree).

We also have such generalizations that cover variability of points on a graph taken from standard spatial statistics.

When studying two standardized variables measured at the same 10 locations; for instance PH and humidity the standard quantification of covariation is their {}. \[cov(x,y)= sum(x1*y1+x2*y2+x3*y3 + \lcdots x10*y10)\] if x and y co-vary -in the same direction this will be big; we already saw that \[cor(x,y)=\frac{cov(x,y)}{sqrt(var(x))sqrt(var(y))}\]

A simple generalization to multiple matrices to measure as above is done through Co-Inertia analysis (CIA).

V coefficient The global measure of similarity of two data tables as opposed to two vectors can be done by a generalization of covariance provided by an inner product between tables that gives the RV coefficient, a number between 0 and 1, like a correlation coefficient, but for tables. \[RV(A,B)=\frac{Tr(A'AB'B)}{\sqrt{Tr(A'A)}\sqrt{Tr(B'B)}}\] There are several other measures of matrix correlation available in the package .

If we do ascertain a link between two matrices.

Canonical Correlation Analysis (CCA)

CCA (Hotelling, 1938) works with two sets of variables and its goal is to find a linear projection of the first set of variables that maximally correlates with a linear projection of the second set of variables.

Finding correlated functions (covariates) of the two views of the same phenomenon by discarding the representation-specific details (noise) is expected to reveal the underlying hidden yet influential factors responsible for the correlation

Canonical Correlation Algorithm

Let us consider two matrices \(X\) and \(Y\) of order \(n > p\) and \(n > q\) respectively.

The columns of \(X\) and \(Y\) correspond to variables and the rows correspond to experimental units.

The jth column of the matrix \(X\) is denoted by \(X_j\), likewise the kth column of \(Y\) is denoted by \(Y_k\). Without loss of generality it will be assumed that the columns of X and Y are standardized (mean 0 and variance 1). Furthermore, it is assumed that \(p \leq q\) (in other words, the group which contains the least variables is denoted by X). We denote by \(S_{XX}\) and \(S_{YY}\) the sample covariance matrices for variable sets X and Y respectively, and by \(S_{XY} = S_{YX}^T\) the sample cross-covariance matrix between X and Y .

Classical CCA assumes first \(p \leq n\) and \(q \leq n\), then that matrices \(X\) and \(Y\) are of full column rank p and q respectively. In the following, the principle of CCA is presented as a problem solved through an iterative algorithm. The first stage of CCA consists in finding two vectors \(a =(a_1,...,a_p)^T\) and \(b =(b_1,...,b_q)^T\) that maximize the correlation between the linear combinations \[\begin{eqnarray*} U=Xa&=&a_1 X_1 +a_2 X_2 +\cdots a_p X_p\\ V=Yb&=&b_1 Y_1 +b_2 Y_2 +\cdots a_q Y_q\\ \end{eqnarray*}\]

and assuming that vectors a and b are normalized so that \[var(U) = var(V) = 1 .\] In other words, the problem consists in solving \[\rho_1=cor(U,V)=max_{a,b} cor (Xa,Yb) \mbox{ subject to } var(Xa)=var(Yb)=1\]

The resulting variables U and V are called the first canonical variates and \(\rho_1\) is referred to as the first canonical correlation.

Higher order canonical variates and canonical correlations can be found as a stepwise problem. For \(s = 1,...,p\), we can successively find positive correlations \(\rho_1 \geq \rho_2 \geq\ldots \geq \rho_p\) with corresponding vectors \((a^1, b^1), \ldots, (a^p, b^p)\), by maximizing \[\rho_s=cor(U^s,V^s)=max_{a^s,b^s} cor (Xa^s,Yb^s)\qquad \mbox{ subject to } var(Xa^s)=var(Yb^s)=1\] under the additional restrictions \[cor(U^s,U^t)=cor(V^s,V^t)=0, \mbox{ for } 1\leq t < s \leq p.\]

Sparse Canonical Correlation Analysis (sCCA)

When the number of variables in each table is very large it is beneficial to add a penalty that keeps the number of non-zero coefficients to a minimum. This approach is called sparse canonical correlation analysis (sparse CCA or sCCA), a method well-suited to both exploratory comparisons between samples and the identification of features with interesting variation. We will use an implementation from the PMA package.

Here we study a dataset, collected by with two tables; one contingency table of bacterial abundances and another with metabolites. There are 12 samples, the metabolite table has measurements on 637 feature and the bacterial abundances had a total of 20,609 OTUs, which we will filter down.

library("dplyr")
library("phyloseq")
metab_path = "../../Books/CUBook/data/metabolites.csv"
load("../../Books/CUBook/data/microbe.rda")
metab   = read.csv(metab_path, row.names = 1) %>% as.matrix

%- We first filter down to bacteria and metabolites of interest, removing those that are zero across many samples. Then, we transform them to weaken the heavy tails.

library("genefilter")
metab   = metab[rowSums(metab == 0) <= 3, ]
microbe = prune_taxa(taxa_sums(microbe) > 4, microbe)
microbe = filter_taxa(microbe, filterfun(kOverA(3, 2)), TRUE)
metab = log(1 + metab, base = 10)
X = as.matrix(otu_table(microbe))
X[X > 50] = 50

We look as if there is any association between the two matrices, we have to match the same dimensions on both matrices:

library("ade4")
colnames(X)=colnames(metab)
pca1 = dudi.pca(t(metab), scal = TRUE, scann = FALSE)
pca2 = dudi.pca(t(X), scal = TRUE, scann = FALSE)
rv1 = RV.rtest(pca1$tab, pca2$tab, 999)
rv1
## Monte-Carlo test
## Call: RV.rtest(df1 = pca1$tab, df2 = pca2$tab, nrepet = 999)
## 
## Observation: 0.8190389 
## 
## Based on 999 replicates
## Simulated p-value: 0.001 
## Alternative hypothesis: greater 
## 
##     Std.Obs Expectation    Variance 
## 6.224494576 0.347934842 0.005728296

We can now apply sparse CCA. This method compares sets of features across high-dimensional data tables, where there may be more measured features than samples. In the process, it chooses a subset of available features that capture the most covariance – these are the features that reflect signals present across multiple tables.

Our implementation is below. The parameters and are sparsity penalties. Larger values of will result in fewer selected microbes, similarly modulates the number of selected metabolites. We tune them manually to facilitate subsequent interpretation – we generally prefer more sparsity than the default parameters would provide.

library("PMA")
ccaRes = CCA(t(X), t(metab), penaltyx = 0.15, penaltyz = 0.15)
## 123456789101112131415
ccaRes
## Call: CCA(x = t(X), z = t(metab), penaltyx = 0.15, penaltyz = 0.15)
## 
## 
## Num non-zeros u's:  5 
## Num non-zeros v's:  15 
## Type of x:  standard 
## Type of z:  standard 
## Penalty for x: L1 bound is  0.15 
## Penalty for z: L1 bound is  0.15 
## Cor(Xu,Zv):  0.9735258

With these parameters, 5 bacteria and 15 metabolites were selected based on their ability to explain covariation between tables. Further, these 20 features result in a correlation of 0.974 between the two tables.

We interpret this to mean that the microbial and metabolomic data reflect similar underlying signals, and that these signals can be approximated well by the 20 selected features.

Be wary of the correlation value, however, since the scores are far from the usual bivariate normal cloud. Further, note that it is possible that other subsets of features could explain the data just as well – sparse CCA has minimized redundancy across features, but makes no guarantee that these are the ``true’’ features in any sense.

We can still use these 20 features to compress information from the two tables without much loss.

To relate the recovered metabolites and OTUs to characteristics of the samples on which they were measured, we use them as input to an ordinary PCA.

combined = cbind(t(X[ccaRes$u != 0, ]),
                 t(metab[ccaRes$v != 0, ]))
pcaRes = dudi.pca(combined, scannf = FALSE, nf = 3)

# annotation
genotype     = substr(rownames(pcaRes$li), 1, 2)
sampleType  = substr(rownames(pcaRes$l1), 3, 4)
featureType = grepl("\\.", colnames(combined))
featureType = ifelse(featureType, "Metabolite", "OTU")

sampleInfo  = data.frame(pcaRes$li, genotype, sampleType)
featureInfo = data.frame(pcaRes$c1,
                          feature = substr(colnames(combined), 1, 6))

%– We make a PCA {}, where we show different types of samples and the multidomain features (Metabolites and OTUs).

This allows comparison across the measured samples – triangles for knockout and circles for wild type – and characterizes the influence the different features – diamonds with text labels.

For example, we see that the main variation in the data is across PD and ST samples, which correspond to the different diets. Further, large values of 15 of the features are associated with ST status, while small values for 5 of them indicate PD status. The advantage of the sparse CCA screening is now clear – we can display most of the variation across samples using a relatively simple plot, and can avoid plotting the hundreds of additional points that would be needed to display all of the features.

library("ggrepel")
ggplot() +  geom_point(data = sampleInfo,
  aes(x = Axis1, y = Axis2, col = sampleType, shape = genotype), size = 3) +
  geom_label_repel(data = featureInfo,
  aes(x = 5.5 * CS1, y = 5.5 * CS2, label = feature, fill = featureType),
      size = 2, segment.size = 0.3,
      label.padding = unit(0.1, "lines"), label.size = 0) +
  geom_point(data = featureInfo,
             aes(x = 5.5 * CS1, y = 5.5 * CS2, fill = featureType),
             size = 1, shape = 23, col = "#383838") +
  scale_color_brewer(palette = "Set2") +
  scale_fill_manual(values = c("#a6d854", "#e78ac3")) +
  guides(fill = guide_legend(override.aes = list(shape = 32, size = 0))) +
  coord_fixed()+
  labs(x = sprintf("Axis1 [%s%% Variance]",
                   100 * round(pcaRes$eig[1] / sum(pcaRes$eig), 2)),
       y = sprintf("Axis2 [%s%% Variance]",
                   100 * round(pcaRes$eig[2] / sum(pcaRes$eig), 2)),
       fill = "Feature Type", col = "Sample Type")

Multivariate Analysis: A summary

Graphical and Interpretational Aides